{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "1Pi_B2cvdBiW"
      },
      "source": [
        "##### Copyright 2018 The TF-Agents Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "nQnmcm0oI1Q-"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "3NFuTvWVZG_B"
      },
      "source": [
        "# Policies\n",
        "\n",
        "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/agents/tutorials/3_policies_tutorial\"\u003e\n",
        "    \u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003e\n",
        "    View on TensorFlow.org\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/agents/blob/master/docs/tutorials/3_policies_tutorial.ipynb\"\u003e\n",
        "    \u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003e\n",
        "    Run in Google Colab\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/agents/blob/master/docs/tutorials/3_policies_tutorial.ipynb\"\u003e\n",
        "    \u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003e\n",
        "    View source on GitHub\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/agents/docs/tutorials/3_policies_tutorial.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "\u003c/table\u003e"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "31uij8nIo5bG"
      },
      "source": [
        "## Introduction"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "PqFn7q5bs3BF"
      },
      "source": [
        "In Reinforcement  Learning terminology, policies map an observation from the environment to an action or a distribution over actions. In TF-Agents, observations from the environment are contained in a named tuple `TimeStep('step_type', 'discount', 'reward', 'observation')`, and policies map timesteps to actions or distributions over actions. Most policies use  `timestep.observation`, some policies use `timestep.step_type` (e.g. to reset the state at the beginning of an episode in stateful policies), but `timestep.discount` and `timestep.reward` are usually ignored.\n",
        "\n",
        "Policies are related to other components in TF-Agents in the following way. Most policies have a neural network to compute actions and/or distributions over actions from TimeSteps. Agents can contain one or more policies for different purposes, e.g. a main policy that is being trained for deployment, and a noisy policy for data collection. Policies can be saved/restored, and can be used indepedently of the agent for data collection, evaluation etc.\n",
        "\n",
        "Some policies are easier to write in Tensorflow (e.g. those with a neural network), whereas others are easier to write in Python (e.g. following a script of actions). So in TF agents, we allow both Python and Tensorflow policies. Morever, policies written in TensorFlow might have to be used in a Python environment, or vice versa, e.g. a TensorFlow policy is used for training but later deployed in a production python environment. To make this easier, we provide wrappers for converting between python and TensorFlow policies.\n",
        "\n",
        "Another interesting class of policies are policy wrappers, which modify a given policy in a certain way, e.g. add a particular type of noise, make a greedy or epsilon-greedy version of a stochastic policy, randomly mix multiple policies etc.  "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "HdnG_TT_amWH"
      },
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "9Meq2nT_aquh"
      },
      "source": [
        "If you haven't installed tf-agents yet, run:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "xsLTHlVdiZP3"
      },
      "outputs": [],
      "source": [
        "!pip install tf-agents"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "sdvop99JlYSM"
      },
      "outputs": [],
      "source": [
        "from __future__ import absolute_import\n",
        "from __future__ import division\n",
        "from __future__ import print_function\n",
        "\n",
        "import abc\n",
        "import tensorflow as tf\n",
        "import tensorflow_probability as tfp\n",
        "import numpy as np\n",
        "\n",
        "from tf_agents.specs import array_spec\n",
        "from tf_agents.specs import tensor_spec\n",
        "from tf_agents.networks import network\n",
        "\n",
        "from tf_agents.policies import py_policy\n",
        "from tf_agents.policies import random_py_policy\n",
        "from tf_agents.policies import scripted_py_policy\n",
        "\n",
        "from tf_agents.policies import tf_policy\n",
        "from tf_agents.policies import random_tf_policy\n",
        "from tf_agents.policies import actor_policy\n",
        "from tf_agents.policies import q_policy\n",
        "from tf_agents.policies import greedy_policy\n",
        "\n",
        "from tf_agents.trajectories import time_step as ts\n",
        "\n",
        "tf.compat.v1.enable_v2_behavior()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "NyXO5-Aalb-6"
      },
      "source": [
        "## Python Policies"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "DOtUZ1hs02bu"
      },
      "source": [
        "The interface for Python policies is defined in `policies/py_policy.PyPolicy`. The main methods are:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "4PqNEVls1uqc"
      },
      "outputs": [],
      "source": [
        "class Base(object):\n",
        "\n",
        "  @abc.abstractmethod\n",
        "  def __init__(self, time_step_spec, action_spec, policy_state_spec=()):\n",
        "    self._time_step_spec = time_step_spec\n",
        "    self._action_spec = action_spec\n",
        "    self._policy_state_spec = policy_state_spec\n",
        "\n",
        "  @abc.abstractmethod\n",
        "  def reset(self, policy_state=()):\n",
        "    # return initial_policy_state.\n",
        "    pass\n",
        "\n",
        "  @abc.abstractmethod\n",
        "  def action(self, time_step, policy_state=()):\n",
        "    # return a PolicyStep(action, state, info) named tuple.\n",
        "    pass\n",
        "\n",
        "  @abc.abstractmethod\n",
        "  def distribution(self, time_step, policy_state=()):\n",
        "    # Not implemented in python, only for TF policies.\n",
        "    pass\n",
        "\n",
        "  @abc.abstractmethod\n",
        "  def update(self, policy):\n",
        "    # update self to be similar to the input `policy`.\n",
        "    pass\n",
        "\n",
        "  @property\n",
        "  def time_step_spec(self):\n",
        "    return self._time_step_spec\n",
        "\n",
        "  @property\n",
        "  def action_spec(self):\n",
        "    return self._action_spec\n",
        "\n",
        "  @property\n",
        "  def policy_state_spec(self):\n",
        "    return self._policy_state_spec"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "16kyDKk65bka"
      },
      "source": [
        "The most important method is `action(time_step)` which maps a `time_step` containing an observation from the environment to a PolicyStep named tuple containing the following attributes:\n",
        "\n",
        "*  `action`: The action to be applied to the environment.\n",
        "*  `state`: The state of the policy (e.g. RNN state) to be fed into the next call to action.\n",
        "*  `info`: Optional side information such as action log probabilities.\n",
        "\n",
        "The `time_step_spec` and `action_spec` are specifications for the input time step and the output action. Policies also have a `reset` function which is typically used for resetting the state in stateful policies. The `update(new_policy)` function updates `self` towards `new_policy`.\n",
        "\n",
        "Now, let us look at a couple of examples of python policies.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "YCH1Hs_WlmDT"
      },
      "source": [
        "### Example 1: Random Python Policy"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "lbnQ0BQ3_0N2"
      },
      "source": [
        "A simple example of a `PyPolicy` is the `RandomPyPolicy` which generates random actions for the discrete/continuous given action_spec. The input `time_step` is ignored."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "QX8M4Nl-_0uu"
      },
      "outputs": [],
      "source": [
        "action_spec = array_spec.BoundedArraySpec((2,), np.int32, -10, 10)\n",
        "my_random_py_policy = random_py_policy.RandomPyPolicy(time_step_spec=None,\n",
        "    action_spec=action_spec)\n",
        "time_step = None\n",
        "action_step = my_random_py_policy.action(time_step)\n",
        "print(action_step)\n",
        "action_step = my_random_py_policy.action(time_step)\n",
        "print(action_step)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "B8WrFOR1lz31"
      },
      "source": [
        "### Example 2: Scripted Python Policy"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "AJ0Br1lGBnTT"
      },
      "source": [
        "A scripted policy plays back a script of actions represented as a list of `(num_repeats, action)` tuples. Every time the `action` function is called, it returns the next action from the list until the specified number of repeats is done, and then moves on to the next action in the list. The `reset` method can be called to start executing from the beginning of the list."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "_mZ244m4BUYv"
      },
      "outputs": [],
      "source": [
        "action_spec = array_spec.BoundedArraySpec((2,), np.int32, -10, 10)\n",
        "action_script = [(1, np.array([5, 2], dtype=np.int32)), \n",
        "                 (0, np.array([0, 0], dtype=np.int32)), # Setting `num_repeates` to 0 will skip this action.\n",
        "                 (2, np.array([1, 2], dtype=np.int32)), \n",
        "                 (1, np.array([3, 4], dtype=np.int32))]\n",
        "\n",
        "my_scripted_py_policy = scripted_py_policy.ScriptedPyPolicy(\n",
        "    time_step_spec=None, action_spec=action_spec, action_script=action_script)\n",
        "\n",
        "policy_state = my_scripted_py_policy.get_initial_state()\n",
        "time_step = None\n",
        "print('Executing scripted policy...')\n",
        "action_step = my_scripted_py_policy.action(time_step, policy_state)\n",
        "print(action_step)\n",
        "action_step= my_scripted_py_policy.action(time_step, action_step.state)\n",
        "print(action_step)\n",
        "action_step = my_scripted_py_policy.action(time_step, action_step.state)\n",
        "print(action_step)\n",
        "\n",
        "print('Resetting my_scripted_py_policy...')\n",
        "policy_state = my_scripted_py_policy.get_initial_state()\n",
        "action_step = my_scripted_py_policy.action(time_step, policy_state)\n",
        "print(action_step)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "3Dz7HSTZl6aU"
      },
      "source": [
        "## TensorFlow Policies"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "LwcoBXqKl8Yb"
      },
      "source": [
        "TensorFlow policies follow the same interface as Python policies. Let us look at a few examples."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "3x8pDWEFrQ5C"
      },
      "source": [
        "### Example 1: Random TF Policy\n",
        "\n",
        "A RandomTFPolicy can be used to generate random actions according to a given discrete/continuous `action_spec`. The input `time_step` is ignored.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "nZ3pe5G4rjrW"
      },
      "outputs": [],
      "source": [
        "action_spec = tensor_spec.BoundedTensorSpec(\n",
        "    (2,), tf.float32, minimum=-1, maximum=3)\n",
        "input_tensor_spec = tensor_spec.TensorSpec((2,), tf.float32)\n",
        "time_step_spec = ts.time_step_spec(input_tensor_spec)\n",
        "\n",
        "my_random_tf_policy = random_tf_policy.RandomTFPolicy(\n",
        "    action_spec=action_spec, time_step_spec=time_step_spec)\n",
        "observation = tf.ones(time_step_spec.observation.shape)\n",
        "time_step = ts.restart(observation)\n",
        "action_step = my_random_tf_policy.action(time_step)\n",
        "\n",
        "print('Action:')\n",
        "print(action_step.action)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "GOBoWETprWCB"
      },
      "source": [
        "### Example 2: Actor Policy\n",
        "\n",
        "An actor policy can be created using either a network that maps `time_steps` to actions or a network that maps `time_steps` to distributions over actions.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "2S94E5zQgge_"
      },
      "source": [
        "#### Using an action network"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "X2LM5STNgv1u"
      },
      "source": [
        "Let us define a network as follows:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "S2wFgzJFteQX"
      },
      "outputs": [],
      "source": [
        "class ActionNet(network.Network):\n",
        "\n",
        "  def __init__(self, input_tensor_spec, output_tensor_spec):\n",
        "    super(ActionNet, self).__init__(\n",
        "        input_tensor_spec=input_tensor_spec,\n",
        "        state_spec=(),\n",
        "        name='ActionNet')\n",
        "    self._output_tensor_spec = output_tensor_spec\n",
        "    self._sub_layers = [\n",
        "        tf.keras.layers.Dense(\n",
        "            action_spec.shape.num_elements(), activation=tf.nn.tanh),\n",
        "    ]\n",
        "\n",
        "  def call(self, observations, step_type, network_state):\n",
        "    del step_type\n",
        "\n",
        "    output = tf.cast(observations, dtype=tf.float32)\n",
        "    for layer in self._sub_layers:\n",
        "      output = layer(output)\n",
        "    actions = tf.reshape(output, [-1] + self._output_tensor_spec.shape.as_list())\n",
        "\n",
        "    # Scale and shift actions to the correct range if necessary.\n",
        "    return actions, network_state"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "k7fIn-ybVdC6"
      },
      "source": [
        "In TensorFlow most network layers are designed for batch operations, so we expect the input time_steps to be batched, and the output of the network will be batched as well. Also the network is responsible for producing actions in the correct range of the given action_spec. This is conventionally done using e.g. a tanh activation for the final layer to produce actions in [-1, 1] and then scaling and shifting this to the correct range as the input action_spec (e.g. see `tf_agents/agents/ddpg/networks.actor_network()`).\n",
        "\n",
        "Now, we can create an actor policy using the above network."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "0UGmFTe7a5VQ"
      },
      "outputs": [],
      "source": [
        "input_tensor_spec = tensor_spec.TensorSpec((4,), tf.float32)\n",
        "time_step_spec = ts.time_step_spec(input_tensor_spec)\n",
        "action_spec = tensor_spec.BoundedTensorSpec((3,),\n",
        "                                            tf.float32,\n",
        "                                            minimum=-1,\n",
        "                                            maximum=1)\n",
        "\n",
        "action_net = ActionNet(input_tensor_spec, action_spec)\n",
        "\n",
        "my_actor_policy = actor_policy.ActorPolicy(\n",
        "    time_step_spec=time_step_spec,\n",
        "    action_spec=action_spec,\n",
        "    actor_network=action_net)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "xlmGPTAmfPK3"
      },
      "source": [
        "We can apply it to any batch of time_steps that follow time_step_spec:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "fvsIsR0VfOA4"
      },
      "outputs": [],
      "source": [
        "batch_size = 2\n",
        "observations = tf.ones([2] + time_step_spec.observation.shape.as_list())\n",
        "\n",
        "time_step = ts.restart(observations, batch_size)\n",
        "\n",
        "action_step = my_actor_policy.action(time_step)\n",
        "print('Action:')\n",
        "print(action_step.action)\n",
        "\n",
        "distribution_step = my_actor_policy.distribution(time_step)\n",
        "print('Action distribution:')\n",
        "print(distribution_step.action)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "lumtyhejZOXR"
      },
      "source": [
        "In the above example, we created the policy using an action network that produces an action tensor. In this case, `policy.distribution(time_step)` is a deterministic (delta) distribution around the output of `policy.action(time_step)`. One way to produce a stochastic policy is to wrap the actor policy in a policy wrapper that adds noise to the actions. Another way is to create the actor policy using an action distribution network instead of an action network as shown below."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "_eNrJ5gKgl3W"
      },
      "source": [
        "#### Using an action distribution network"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "sSYzC9LobVsK"
      },
      "outputs": [],
      "source": [
        "class ActionDistributionNet(ActionNet):\n",
        "\n",
        "  def call(self, observations, step_type, network_state):\n",
        "    action_means, network_state = super(ActionDistributionNet, self).call(\n",
        "        observations, step_type, network_state)\n",
        "\n",
        "    action_std = tf.ones_like(action_means)\n",
        "    return tfp.distributions.Normal(action_means, action_std), network_state\n",
        "\n",
        "\n",
        "action_distribution_net = ActionDistributionNet(input_tensor_spec, action_spec)\n",
        "\n",
        "my_actor_policy = actor_policy.ActorPolicy(\n",
        "    time_step_spec=time_step_spec,\n",
        "    action_spec=action_spec,\n",
        "    actor_network=action_distribution_net)\n",
        "\n",
        "action_step = my_actor_policy.action(time_step)\n",
        "print('Action:')\n",
        "print(action_step.action)\n",
        "distribution_step = my_actor_policy.distribution(time_step)\n",
        "print('Action distribution:')\n",
        "print(distribution_step.action)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "BzoNGJnlibtz"
      },
      "source": [
        "Note that in the above, actions are clipped to the range of the given action spec [-1, 1]. This is because a constructor argument of ActorPolicy clip=True by default. Setting this to false will return unclipped actions produced by the network. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "PLj6A-5domNG"
      },
      "source": [
        "Stochastic policies can be converted to deterministic policies using, for example, a GreedyPolicy wrapper which chooses `stochastic_policy.distribution().mode()` as its action, and a deterministic/delta distribution around this greedy action as its `distribution()`."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "4Xxzo2a7rZ7v"
      },
      "source": [
        "### Example 3: Q Policy"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "79eGLqpOhQVp"
      },
      "source": [
        "A Q policy is used in agents like DQN and is based on a Q network that predicts a Q value for each discrete action. For a given time step, the action distribution in the Q Policy is a categorical distribution created using the q values as logits.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Haakr2VvjqKC"
      },
      "outputs": [],
      "source": [
        "input_tensor_spec = tensor_spec.TensorSpec((4,), tf.float32)\n",
        "time_step_spec = ts.time_step_spec(input_tensor_spec)\n",
        "action_spec = tensor_spec.BoundedTensorSpec((1,),\n",
        "                                            tf.int32,\n",
        "                                            minimum=-1,\n",
        "                                            maximum=1)\n",
        "num_actions = action_spec.maximum - action_spec.minimum + 1\n",
        "\n",
        "\n",
        "class QNetwork(network.Network):\n",
        "\n",
        "  def __init__(self, input_tensor_spec, action_spec, num_actions=num_actions, name=None):\n",
        "    super(QNetwork, self).__init__(\n",
        "        input_tensor_spec=input_tensor_spec,\n",
        "        state_spec=(),\n",
        "        name=name)\n",
        "    self._sub_layers = [\n",
        "        tf.keras.layers.Dense(num_actions),\n",
        "    ]\n",
        "\n",
        "  def call(self, inputs, step_type=None, network_state=()):\n",
        "    del step_type\n",
        "    inputs = tf.cast(inputs, tf.float32)\n",
        "    for layer in self._sub_layers:\n",
        "      inputs = layer(inputs)\n",
        "    return inputs, network_state\n",
        "\n",
        "\n",
        "batch_size = 2\n",
        "observation = tf.ones([batch_size] + time_step_spec.observation.shape.as_list())\n",
        "time_steps = ts.restart(observation, batch_size=batch_size)\n",
        "\n",
        "my_q_network = QNetwork(\n",
        "    input_tensor_spec=input_tensor_spec,\n",
        "    action_spec=action_spec)\n",
        "my_q_policy = q_policy.QPolicy(\n",
        "    time_step_spec, action_spec, q_network=my_q_network)\n",
        "action_step = my_q_policy.action(time_steps)\n",
        "distribution_step = my_q_policy.distribution(time_steps)\n",
        "\n",
        "print('Action:')\n",
        "print(action_step.action)\n",
        "\n",
        "print('Action distribution:')\n",
        "print(distribution_step.action)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Xpu9m6mvqJY-"
      },
      "source": [
        "## Policy Wrappers"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "OfaUrqRAoigk"
      },
      "source": [
        "A policy wrapper can be used to wrap and modify a given policy, e.g. add noise. Policy wrappers are a subclass of Policy (Python/TensorFlow) and can therefore be used just like any other policy. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "-JJVVAALqVNQ"
      },
      "source": [
        "### Example: Greedy Policy\n",
        "\n",
        "\n",
        "A greedy wrapper can be used to wrap any TensorFlow policy that implements `distribution()`. `GreedyPolicy.action()` will return `wrapped_policy.distribution().mode()` and `GreedyPolicy.distribution()` is a deterministic/delta distribution around `GreedyPolicy.action()`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "xsRPBeLZtXvu"
      },
      "outputs": [],
      "source": [
        "my_greedy_policy = greedy_policy.GreedyPolicy(my_q_policy)\n",
        "\n",
        "action_step = my_greedy_policy.action(time_steps)\n",
        "print('Action:')\n",
        "print(action_step.action)\n",
        "\n",
        "distribution_step = my_greedy_policy.distribution(time_steps)\n",
        "print('Action distribution:')\n",
        "print(distribution_step.action)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "TF-Agents Policies Tutorial.ipynb",
      "private_outputs": true,
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
